DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.

Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

  • How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
  • How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
  • How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

DonorsChoose

About the DonorsChoose Data Set

The train.csv data set provided by DonorsChoose contains the following features:

Feature Description
project_id A unique identifier for the proposed project. Example: p036502
project_title Title of the project. Examples:
  • Art Will Make You Happy!
  • First Grade Fun
project_grade_category Grade level of students for which the project is targeted. One of the following enumerated values:
  • Grades PreK-2
  • Grades 3-5
  • Grades 6-8
  • Grades 9-12
project_subject_categories One or more (comma-separated) subject categories for the project from the following enumerated list of values:
  • Applied Learning
  • Care & Hunger
  • Health & Sports
  • History & Civics
  • Literacy & Language
  • Math & Science
  • Music & The Arts
  • Special Needs
  • Warmth

Examples:
  • Music & The Arts
  • Literacy & Language, Math & Science
school_state State where school is located (Two-letter U.S. postal code). Example: WY
project_subject_subcategories One or more (comma-separated) subject subcategories for the project. Examples:
  • Literacy
  • Literature & Writing, Social Sciences
project_resource_summary An explanation of the resources needed for the project. Example:
  • My students need hands on literacy materials to manage sensory needs!
project_essay_1 First application essay*
project_essay_2 Second application essay*
project_essay_3 Third application essay*
project_essay_4 Fourth application essay*
project_submitted_datetime Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245
teacher_id A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56
teacher_prefix Teacher's title. One of the following enumerated values:
  • nan
  • Dr.
  • Mr.
  • Mrs.
  • Ms.
  • Teacher.
teacher_number_of_previously_posted_projects Number of project applications previously submitted by the same teacher. Example: 2

* See the section Notes on the Essay Data for more details about these features.

Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature Description
id A project_id value from the train.csv file. Example: p036502
description Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25
quantity Quantity of the resource required. Example: 3
price Price of the resource required. Example: 9.95

Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label Description
project_is_approved A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved.

Notes on the Essay Data

    Prior to May 17, 2016, the prompts for the essays were as follows:
  • __project_essay_1:__ "Introduce us to your classroom"
  • __project_essay_2:__ "Tell us more about your students"
  • __project_essay_3:__ "Describe how your students will use the materials you're requesting"
  • __project_essay_3:__ "Close by sharing why your project will make a difference"
    Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:
  • __project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."
  • __project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"

  • For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
In [0]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import math
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

import dill #To store session variables
#https://stackoverflow.com/questions/34342155/how-to-pickle-or-store-jupyter-ipython-notebook-session-for-later

1.1 Reading Data

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
In [0]:
ls "drive/My Drive/Colab Notebooks"
'06 Implement SGD.ipynb'          Db-IMDB.db
 3_DonorsChoose_KNN_final.ipynb   glove.6B.50d.txt
 4_DonorsChoose_NB_final.ipynb    glove_vectors_300d
 5_DonorsChoose_LR_final.ipynb    glove_vectors_50
 7_DonorsChoose_SVM_final.ipynb   knn.sess
 7_DonorsChoose_SVM.ipynb         resources.csv
 8_DonorsChoose_DT_final.ipynb   'SQL Assignment.ipynb'
 9_DonorsChoose_RF_final.ipynb    train_data.csv
In [0]:
project_data = pd.read_csv('drive/My Drive/Colab Notebooks/train_data.csv')
resource_data = pd.read_csv('drive/My Drive/Colab Notebooks/resources.csv')
In [0]:
project_data_1=project_data[project_data['project_is_approved']==1]
project_data_0=project_data[project_data['project_is_approved']==0]

print(project_data_1.shape)
print(project_data_0.shape)

#Creating a dataset of 0.2k points containg points from both the classes
project_data = project_data_1[0:33458].append(project_data_0[0:16542])
print(project_data['project_is_approved'].value_counts())
print(project_data.shape)
(92706, 17)
(16542, 17)
1    33458
0    16542
Name: project_is_approved, dtype: int64
(50000, 17)
In [0]:
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)
Number of data points in train data (50000, 17)
--------------------------------------------------
The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state'
 'project_submitted_datetime' 'project_grade_category'
 'project_subject_categories' 'project_subject_subcategories'
 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3'
 'project_essay_4' 'project_resource_summary'
 'teacher_number_of_previously_posted_projects' 'project_is_approved']
In [0]:
# how to replace elements in list python: https://stackoverflow.com/a/2582163/4084039
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(project_data.columns)]

#sort dataframe based on time pandas python: https://stackoverflow.com/a/49702492/4084039
project_data['Date'] = pd.to_datetime(project_data['project_submitted_datetime'])
project_data.drop('project_submitted_datetime', axis=1, inplace=True)
project_data.sort_values(by=['Date'], inplace=True)

# how to reorder columns pandas python: https://stackoverflow.com/a/13148611/4084039
project_data = project_data[cols]

project_data.head(2)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved
473 100660 p234804 cbc0e38f522143b86d372f8b43d4cff3 Mrs. GA 2016-04-27 00:53:00 Grades PreK-2 Applied Learning Early Development Flexible Seating for Flexible Learning I recently read an article about giving studen... I teach at a low-income (Title 1) school. Ever... We need a classroom rug that we can use as a c... Benjamin Franklin once said, \"Tell me and I f... My students need flexible seating in the class... 2 1
29891 146723 p099708 c0a28c79fe8ad5810da49de47b3fb491 Mrs. CA 2016-04-27 01:10:09 Grades 3-5 Math & Science, History & Civics Mathematics, Social Sciences Breakout Box to Ignite Engagement! It's the end of the school year. Routines have... My students desire challenges, movement, and c... I will design different clues using specific c... Donations to this project will immediately imp... My students need items from a \"Breakout Box\"... 6 1
In [0]:
print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)
Number of data points in train data (1541272, 4)
['id' 'description' 'quantity' 'price']
Out[0]:
id description quantity price
0 p233245 LC652 - Lakeshore Double-Space Mobile Drying Rack 1 149.00
1 p069063 Bouncy Bands for Desks (Blue support pipes) 3 14.95

1.2 preprocessing of project_subject_categories

In [0]:
categories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in categories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())

cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))

1.3 preprocessing of project_subject_subcategories

In [0]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))

1.3 Text preprocessing

In [0]:
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)
In [0]:
project_data.head(2)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved clean_categories clean_subcategories essay
473 100660 p234804 cbc0e38f522143b86d372f8b43d4cff3 Mrs. GA 2016-04-27 00:53:00 Grades PreK-2 Flexible Seating for Flexible Learning I recently read an article about giving studen... I teach at a low-income (Title 1) school. Ever... We need a classroom rug that we can use as a c... Benjamin Franklin once said, \"Tell me and I f... My students need flexible seating in the class... 2 1 AppliedLearning EarlyDevelopment I recently read an article about giving studen...
29891 146723 p099708 c0a28c79fe8ad5810da49de47b3fb491 Mrs. CA 2016-04-27 01:10:09 Grades 3-5 Breakout Box to Ignite Engagement! It's the end of the school year. Routines have... My students desire challenges, movement, and c... I will design different clues using specific c... Donations to this project will immediately imp... My students need items from a \"Breakout Box\"... 6 1 Math_Science History_Civics Mathematics SocialSciences It's the end of the school year. Routines have...
In [0]:
# printing some random reviews
print(project_data['essay'].values[0])
print("="*50)
print(project_data['essay'].values[150])
print("="*50)
print(project_data['essay'].values[1000])
I recently read an article about giving students a choice about how they learn. We already set goals; why not let them choose where to sit, and give them options of what to sit on?I teach at a low-income (Title 1) school. Every year, I have a class with a range of abilities, yet they are all the same age. They learn differently, and they have different interests. Some have ADHD, and some are fast learners. Yet they are eager and active learners that want and need to be able to move around the room, yet have a place that they can be comfortable to complete their work.We need a classroom rug that we can use as a class for reading time, and students can use during other learning times. I have also requested four Kore Kids wobble chairs and four Back Jack padded portable chairs so that students can still move during whole group lessons without disrupting the class. Having these areas will provide these little ones with a way to wiggle while working.Benjamin Franklin once said, \"Tell me and I forget, teach me and I may remember, involve me and I learn.\" I want these children to be involved in their learning by having a choice on where to sit and how to learn, all by giving them options for comfortable flexible seating.
==================================================
A unit that has captivated my students and one that has forced them to seek out further resources on their own, is the Holocaust unit. This unit not only brought their critical thinking skills to life, but it brought out their passion, love, dislikes, and fears about wars and prejudices to light.My 8th graders students live in a high-poverty school district and live in a large, urban area. They are reluctant readers unless introduced to life-changing books. This book made my students work hard in improving their reading and writing skills. The Holocaust unit brought compassion and history to life. The students wanted to read ahead and learn about tolerance and discrimination.These materials will be used in-class. We were read, discuss, and think critically about the world event that still affects us. The Holocaust is part of our history and its victims and survivors deserve our knowledge and recognition of the hardships they endured. We will be researching the victims and survivors of the Holocaust, read non-fictional text, watch documentaries, and overall broaden our education on this historic event.This project will greatly benefit my students. It will not only help them academically and help prepare them for high school, but it will make them well-rounded individuals who better understand the power of tolerance and war. Please know that you have made a positive impact on my students and we sincerely thank you in advance.
==================================================
Why learn coding in the 5th grade? I teach science through STEM. Instead of using only spaghetti and marshmallows for engineering, I want the students to use coding. It is time to use interactive approaches to solving problems and testing ideas using real-life skills students may use in the future.My school is located in Jupiter, Florida, and we are an intermediate center, servicing only 3rd-5th grades. I teach 3 classes of science to 5th grade students. My students are a mix of gifted and advanced 10 and 11 year olds, of at which 20% have some type of learning challenge, such as ADD or autism. They all have insatiable thirsts for science. Most come to me with limited knowledge of science, but a tremendous understanding of technology. Most have a computer in their home and are familiar with tablets and smartphones. At least 1/3 of my students know Scratch and JavaScript programming.\r\nMy goal is to pair my students incredible knowledge of technology with science concepts to deepen their understandings of that concept. I also want to expose all of my students with coding since research has shown that more computer coders will be needed for future jobs than ever before.\r\nWhat I envision is the students working in groups using the specific coding device, Raspberry Pi, to create codes to manipulate the sensors. These will be attached to laptops at each table.  In the beginning, I will use the device to teach basic coding to solve a problem. The students will be required to learn how to set up the motherboard during this process. Then I will move on to using it with my science content. One activity I found intriguing is the weather station sensors. The students work together to find a way to code for each of these sensors to turn on and off and collect, store, and manipulate the data. This will become a part of my weather unit.By pairing this type of technology with science, I feel my lesson then is reflecting how science works in the real world. Technology and science go hand in hand and I want my students to experience that one influences the other. I want them to experience that scientists use technology as a tool to further deepen their understanding of concepts. I also want both my boys and girls to learn and understanding coding as a viable future career.
In [0]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
In [0]:
sent = decontracted(project_data['essay'].values[2000])
print(sent)
print("="*50)
My school is in a low socio-economic area with a high ELL population. The students in my classroom do not have a lot of academic practice outside of the school day. They love coming to school everyday and are eager to learn. They work very hard and are so excited when they master new concepts.  \r\n   At my school site we strive to make the most of every minute during the school day in order to ensure students are able to learn and feel successful. We know that the time we have with them is very precious!I am asking for the mini white boards and reusable write and wipe pockets in order to help me monitor my students thinking and learning. Often times, when work is done on worksheets the feedback to students is not meaningful because it can take awhile to give each student individual feed back. The white boards and write and wipe pockets will give students a way to show written responses while we are gathered at the carpet together. This will allow me to give immediate feedback to students and then can modify their responses right then and there. This will lead to more meaningful learning and processing.nannan
==================================================
In [0]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)
My school is in a low socio-economic area with a high ELL population. The students in my classroom do not have a lot of academic practice outside of the school day. They love coming to school everyday and are eager to learn. They work very hard and are so excited when they master new concepts.       At my school site we strive to make the most of every minute during the school day in order to ensure students are able to learn and feel successful. We know that the time we have with them is very precious!I am asking for the mini white boards and reusable write and wipe pockets in order to help me monitor my students thinking and learning. Often times, when work is done on worksheets the feedback to students is not meaningful because it can take awhile to give each student individual feed back. The white boards and write and wipe pockets will give students a way to show written responses while we are gathered at the carpet together. This will allow me to give immediate feedback to students and then can modify their responses right then and there. This will lead to more meaningful learning and processing.nannan
In [0]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)
My school is in a low socio economic area with a high ELL population The students in my classroom do not have a lot of academic practice outside of the school day They love coming to school everyday and are eager to learn They work very hard and are so excited when they master new concepts At my school site we strive to make the most of every minute during the school day in order to ensure students are able to learn and feel successful We know that the time we have with them is very precious I am asking for the mini white boards and reusable write and wipe pockets in order to help me monitor my students thinking and learning Often times when work is done on worksheets the feedback to students is not meaningful because it can take awhile to give each student individual feed back The white boards and write and wipe pockets will give students a way to show written responses while we are gathered at the carpet together This will allow me to give immediate feedback to students and then can modify their responses right then and there This will lead to more meaningful learning and processing nannan
In [0]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]
In [0]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())
100%|██████████| 50000/50000 [00:30<00:00, 1628.88it/s]
In [0]:
#adding a new column for the processed essay text
project_data['clean_essay']=preprocessed_essays
print(project_data.columns)

# after preprocesing
preprocessed_essays[2000]
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay'],
      dtype='object')
Out[0]:
'school low socio economic area high ell population students classroom not lot academic practice outside school day love coming school everyday eager learn work hard excited master new concepts school site strive make every minute school day order ensure students able learn feel successful know time precious asking mini white boards reusable write wipe pockets order help monitor students thinking learning often times work done worksheets feedback students not meaningful take awhile give student individual feed back white boards write wipe pockets give students way show written responses gathered carpet together allow give immediate feedback students modify responses right lead meaningful learning processing nannan'

1.4.1 Preprocessing of `project_title`

In [0]:
project_data.head(2)
Out[0]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved clean_categories clean_subcategories essay clean_essay
473 100660 p234804 cbc0e38f522143b86d372f8b43d4cff3 Mrs. GA 2016-04-27 00:53:00 Grades PreK-2 Flexible Seating for Flexible Learning I recently read an article about giving studen... I teach at a low-income (Title 1) school. Ever... We need a classroom rug that we can use as a c... Benjamin Franklin once said, \"Tell me and I f... My students need flexible seating in the class... 2 1 AppliedLearning EarlyDevelopment I recently read an article about giving studen... recently read article giving students choice l...
29891 146723 p099708 c0a28c79fe8ad5810da49de47b3fb491 Mrs. CA 2016-04-27 01:10:09 Grades 3-5 Breakout Box to Ignite Engagement! It's the end of the school year. Routines have... My students desire challenges, movement, and c... I will design different clues using specific c... Donations to this project will immediately imp... My students need items from a \"Breakout Box\"... 6 1 Math_Science History_Civics Mathematics SocialSciences It's the end of the school year. Routines have... end school year routines run course students n...
In [0]:
#Printing a few random review summaries

for i in range(1,3000,1000):
    sent = project_data['project_title'].values[i]
    print(sent,'--- Row No:',i)
    print("="*50)
Breakout Box to Ignite Engagement! --- Row No: 1
==================================================
Cozy Classroom Carpet for Learning --- Row No: 1001
==================================================
Community Circle Carpet: A Place to Call Home! --- Row No: 2001
==================================================
In [0]:
# The above random records show that there are no URLs or HTML tags, but we will remove incase if there are any

from tqdm import tqdm #for status bar
from bs4 import BeautifulSoup #for html tags

preprocessed_title=[]

for title in tqdm(project_data['project_title'].values):
    # To remove urls - https://stackoverflow.com/a/40823105/4084039
    title = re.sub(r"http\S+", "", title)
    
    # To remove all HTML tags
    #https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
    title = BeautifulSoup(title, 'lxml').get_text()
    
    # To split contractions - refer decontracted function defined above
    title = decontracted(title)
    
    # To remove alphanumerics (words with numbers in them) - https://stackoverflow.com/a/18082370/4084039
    title = re.sub("\S*\d\S*", "", title).strip()
    
    # To remove special characters - https://stackoverflow.com/a/5843547/4084039
    title = re.sub('[^A-Za-z]+', ' ', title)
    
    # To remove stop words from the summaries and convert to lowercase
    title = ' '.join(e.lower() for e in title.split() if e.lower() not in stopwords)
    preprocessed_title.append(title.strip())

#adding a new column for cleaned titles
project_data['clean_title']=preprocessed_title
print(project_data.columns)
100%|██████████| 50000/50000 [00:16<00:00, 2948.21it/s]
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title'],
      dtype='object')

1.4.2 Preprocessing of `teacher_prefix`

In [0]:
#replacing Nan values with 'Unknown'
project_data['teacher_prefix']=project_data['teacher_prefix'].replace(np.nan,'Unknown')

1.4.3 Combining resource_data with project_data

In [0]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data = pd.merge(project_data, price_data, on='id', how='left')

1.4.4 Adding word counts for Title and Essay

In [0]:
#https://stackoverflow.com/questions/54397096/how-to-do-word-count-on-pandas-dataframe

project_data['title_wc'] = project_data['clean_title'].str.count(' ')+1

project_data['essay_wc'] = project_data['clean_essay'].str.count(' ')+1

project_data.columns
Out[0]:
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title', 'price', 'quantity', 'title_wc', 'essay_wc'],
      dtype='object')

1.4.5 Adding sentiment scores for each essay

In [0]:
#http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

project_data['senti_score'] = 0
project_data['senti_score'] = project_data['senti_score'].astype(float)

anlyzr = SentimentIntensityAnalyzer()

for index in project_data.index:
  project_data.at[index, 'senti_score'] = anlyzr.polarity_scores(project_data.at[index,'clean_essay'])['compound']
  
print(project_data.columns)
/usr/local/lib/python3.6/dist-packages/nltk/twitter/__init__.py:20: UserWarning:

The twython library has not been installed. Some functionality from the twitter package will not be available.

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title', 'price', 'quantity', 'title_wc', 'essay_wc',
       'senti_score'],
      dtype='object')

1.5 Preparing data for models

In [0]:
project_data.columns
Out[0]:
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title', 'price', 'quantity', 'title_wc', 'essay_wc',
       'senti_score'],
      dtype='object')

we are going to consider

   - school_state : categorical data
   - clean_categories : categorical data
   - clean_subcategories : categorical data
   - project_grade_category : categorical data
   - teacher_prefix : categorical data

   - project_title : text data
   - text : text data
   - project_resource_summary: text data (optinal)

   - quantity : numerical (optinal)
   - teacher_number_of_previously_posted_projects : numerical
   - price : numerical

2. Random Forest and GBDT

2.1 Splitting data into Train and cross validation(or test): Stratified Sampling

In [0]:
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

#Checking if there are any values other than 0 and 1
project_data['project_is_approved'].unique()

#https://answers.dataiku.com/2352/split-dataset-by-stratified-sampling
df_train, df_test = train_test_split(project_data, test_size = 0.3, stratify=project_data['project_is_approved'])
print(df_train.shape,df_test.shape)
(35000, 25) (15000, 25)

2.2 Make Data Model Ready: encoding numerical, categorical features

2.2.1 Vectorizing Categorical data using class probabilities (Response Coding)

In [0]:
print(df_train.columns)
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title', 'price', 'quantity', 'title_wc', 'essay_wc',
       'senti_score'],
      dtype='object')

2.2.1.1 Feature encoding for categories

In [0]:
#https://stackoverflow.com/questions/3839729/count-unique-values-with-pandas-per-groups

# Fetching unique value counts for each class
clean_cat_count = pd.DataFrame()
clean_cat_count[1] = df_train['clean_categories'].where(df_train['project_is_approved']==1).value_counts()
clean_cat_count[0] = df_train['clean_categories'].where(df_train['project_is_approved']==0).value_counts()

#Replacing nan value counts with zeros
clean_cat_count[1]=clean_cat_count[1].replace(np.nan,0)
clean_cat_count[0]=clean_cat_count[0].replace(np.nan,0)

#print(clean_cat_count)

#Calculating probs for each class
for i in clean_cat_count.iterrows():
  clean_cat_count['1_prob'] = clean_cat_count[1]/(clean_cat_count[1]+clean_cat_count[0])
  clean_cat_count['0_prob'] = clean_cat_count[0]/(clean_cat_count[1]+clean_cat_count[0])
  
#print(clean_cat_count)

#appending prob values to train data in a new column
  
for idx,j in clean_cat_count.iterrows():
  for indx,i in df_train.iterrows():
    if idx == df_train.at[indx, 'clean_categories']:
      df_train.at[indx, 'cat_1'] = clean_cat_count.at[idx, '1_prob']
      df_train.at[indx, 'cat_0'] = clean_cat_count.at[idx, '0_prob']
      
print(df_train.head(2))
       Unnamed: 0       id  ...     cat_1     cat_0
47021       44946  p007627  ...  0.622578  0.377422
48842       26216  p071199  ...  0.699446  0.300554

[2 rows x 27 columns]
In [0]:
df_train.isna().any()
Out[0]:
Unnamed: 0                                      False
id                                              False
teacher_id                                      False
teacher_prefix                                  False
school_state                                    False
Date                                            False
project_grade_category                          False
project_title                                   False
project_essay_1                                 False
project_essay_2                                 False
project_essay_3                                  True
project_essay_4                                  True
project_resource_summary                        False
teacher_number_of_previously_posted_projects    False
project_is_approved                             False
clean_categories                                False
clean_subcategories                             False
essay                                           False
clean_essay                                     False
clean_title                                     False
price                                           False
quantity                                        False
title_wc                                        False
essay_wc                                        False
senti_score                                     False
cat_1                                            True
cat_0                                            True
dtype: bool
In [0]:
#appending prob values to test data in a new column. Incase the class is not part of the train data, a prob of 0.5 is assigned
for idx,j in clean_cat_count.iterrows():
  for indx,i in df_test.iterrows():
    if idx == df_test.at[indx, 'clean_categories']:
      df_test.at[indx, 'cat_1'] = clean_cat_count.at[idx, '1_prob']
      df_test.at[indx, 'cat_0'] = clean_cat_count.at[idx, '0_prob']
      
df_test['cat_1']=df_test['cat_0'].replace(np.nan,0.5)
df_test['cat_0']=df_test['cat_0'].replace(np.nan,0.5)

print(df_test.head(2))
       Unnamed: 0       id  ...     cat_1     cat_0
43188       98924  p204347  ...  0.377422  0.377422
39762      173403  p117233  ...  0.310239  0.310239

[2 rows x 27 columns]
In [0]:
df_train['cat_1']=df_train['cat_1'].replace(np.nan,0.5)
df_train['cat_0']=df_train['cat_0'].replace(np.nan,0.5)

2.2.1.2 Feature encoding for subcategories

In [0]:
#https://stackoverflow.com/questions/3839729/count-unique-values-with-pandas-per-groups

# Fetching unique value counts for each class
clean_subcat_count = pd.DataFrame()
clean_subcat_count[1] = df_train['clean_subcategories'].where(df_train['project_is_approved']==1).value_counts()
clean_subcat_count[0] = df_train['clean_subcategories'].where(df_train['project_is_approved']==0).value_counts()

#Replacing nan value counts with zeros
clean_subcat_count[1]=clean_subcat_count[1].replace(np.nan,0)
clean_subcat_count[0]=clean_subcat_count[0].replace(np.nan,0)

#print(clean_subcat_count)

#Calculating probs for each class
for i in clean_subcat_count.iterrows():
  clean_subcat_count['1_prob'] = clean_subcat_count[1]/(clean_subcat_count[1]+clean_subcat_count[0])
  clean_subcat_count['0_prob'] = clean_subcat_count[0]/(clean_subcat_count[1]+clean_subcat_count[0])
  
#print(clean_subcat_count)

#appending prob values to train data in a new column
  
for idx,j in clean_subcat_count.iterrows():
  for indx,i in df_train.iterrows():
    if idx == df_train.at[indx, 'clean_subcategories']:
      df_train.at[indx, 'subcat_1'] = clean_subcat_count.at[idx, '1_prob']
      df_train.at[indx, 'subcat_0'] = clean_subcat_count.at[idx, '0_prob']
      
print(df_test.head(2))
       Unnamed: 0       id  ...     cat_1     cat_0
43188       98924  p204347  ...  0.377422  0.377422
39762      173403  p117233  ...  0.310239  0.310239

[2 rows x 27 columns]
In [0]:
#appending prob values to test data in a new column. Incase the class is not part of the train data, a prob of 0.5 is assigned
for idx,j in clean_subcat_count.iterrows():
  for indx,i in df_test.iterrows():
    if idx == df_test.at[indx, 'clean_subcategories']:
      df_test.at[indx, 'subcat_1'] = clean_subcat_count.at[idx, '1_prob']
      df_test.at[indx, 'subcat_0'] = clean_subcat_count.at[idx, '0_prob']
      
df_test['subcat_1']=df_test['subcat_1'].replace(np.nan,0.5)
df_test['subcat_0']=df_test['subcat_0'].replace(np.nan,0.5)

print(df_test.head(2))
       Unnamed: 0       id  ...  subcat_1  subcat_0
43188       98924  p204347  ...  0.651832  0.348168
39762      173403  p117233  ...  0.636752  0.363248

[2 rows x 29 columns]
In [0]:
df_train['subcat_1']=df_train['subcat_1'].replace(np.nan,0.5)
df_train['subcat_0']=df_train['subcat_0'].replace(np.nan,0.5)

2.2.1.3 Feature encoding for state

In [0]:
#https://stackoverflow.com/questions/3839729/count-unique-values-with-pandas-per-groups

# Fetching unique value counts for each class
state_count = pd.DataFrame()
state_count[1] = df_train['school_state'].where(df_train['project_is_approved']==1).value_counts()
state_count[0] = df_train['school_state'].where(df_train['project_is_approved']==0).value_counts()

#Replacing nan value counts with zeros
state_count[1]=state_count[1].replace(np.nan,0)
state_count[0]=state_count[0].replace(np.nan,0)

#print(state_count)

#Calculating probs for each class
for i in state_count.iterrows():
  state_count['1_prob'] = state_count[1]/(state_count[1]+state_count[0])
  state_count['0_prob'] = state_count[0]/(state_count[1]+state_count[0])
  
#print(state_count)

#appending prob values to train data in a new column
  
for idx,j in state_count.iterrows():
  for indx,i in df_train.iterrows():
    if idx == df_train.at[indx, 'school_state']:
      df_train.at[indx, 'state_1'] = state_count.at[idx, '1_prob']
      df_train.at[indx, 'state_0'] = state_count.at[idx, '0_prob']
      
print(df_test.head(2))
       Unnamed: 0       id  ...  subcat_1  subcat_0
43188       98924  p204347  ...  0.651832  0.348168
39762      173403  p117233  ...  0.636752  0.363248

[2 rows x 29 columns]
In [0]:
#appending prob values to test data in a new column. Incase the class is not part of the train data, a prob of 0.5 is assigned
for idx,j in state_count.iterrows():
  for indx,i in df_test.iterrows():
    if idx == df_test.at[indx, 'school_state']:
      df_test.at[indx, 'state_1'] = state_count.at[idx, '1_prob']
      df_test.at[indx, 'state_0'] = state_count.at[idx, '0_prob']
      
df_test['state_1']=df_test['state_1'].replace(np.nan,0.5)
df_test['state_0']=df_test['state_0'].replace(np.nan,0.5)

print(df_test.head(2))
       Unnamed: 0       id  ...   state_1   state_0
43188       98924  p204347  ...  0.674091  0.325909
39762      173403  p117233  ...  0.729908  0.270092

[2 rows x 31 columns]

2.2.1.4 Feature encoding for teacher_prefix

In [0]:
#https://stackoverflow.com/questions/3839729/count-unique-values-with-pandas-per-groups

# Fetching unique value counts for each class
teacherprefix_count = pd.DataFrame()
teacherprefix_count[1] = df_train['teacher_prefix'].where(df_train['project_is_approved']==1).value_counts()
teacherprefix_count[0] = df_train['teacher_prefix'].where(df_train['project_is_approved']==0).value_counts()

#Replacing nan value counts with zeros
teacherprefix_count[1]=teacherprefix_count[1].replace(np.nan,0)
teacherprefix_count[0]=teacherprefix_count[0].replace(np.nan,0)

#print(teacherprefix_count)

#Calculating probs for each class
for i in teacherprefix_count.iterrows():
  teacherprefix_count['1_prob'] = teacherprefix_count[1]/(teacherprefix_count[1]+teacherprefix_count[0])
  teacherprefix_count['0_prob'] = teacherprefix_count[0]/(teacherprefix_count[1]+teacherprefix_count[0])
  
#print(teacherprefix_count)

#appending prob values to train data in a new column
  
for idx,j in teacherprefix_count.iterrows():
  for indx,i in df_train.iterrows():
    if idx == df_train.at[indx, 'teacher_prefix']:
      df_train.at[indx, 'teacherprefix_1'] = teacherprefix_count.at[idx, '1_prob']
      df_train.at[indx, 'teacherprefix_0'] = teacherprefix_count.at[idx, '0_prob']
In [0]:
print(df_train['teacherprefix_0'].head(2))
47021    0.337500
48842    0.322658
Name: teacherprefix_0, dtype: float64
In [0]:
#appending prob values to test data in a new column. Incase the class is not part of the train data, a prob of 0.5 is assigned
for idx,j in teacherprefix_count.iterrows():
  for indx,i in df_test.iterrows():
    if idx == df_test.at[indx, 'teacher_prefix']:
      df_test.at[indx, 'teacherprefix_1'] = teacherprefix_count.at[idx, '1_prob']
      df_test.at[indx, 'teacherprefix_0'] = teacherprefix_count.at[idx, '0_prob']
      
df_test['teacherprefix_1']=df_test['teacherprefix_1'].replace(np.nan,0.5)
df_test['teacherprefix_0']=df_test['teacherprefix_0'].replace(np.nan,0.5)

print(df_test['teacherprefix_0'].head(2))
43188    0.3348
39762    0.3348
Name: teacherprefix_0, dtype: float64
In [0]:
df_train['teacherprefix_1']=df_train['teacherprefix_1'].replace(np.nan,0.5)
df_train['teacherprefix_0']=df_train['teacherprefix_0'].replace(np.nan,0.5)

2.2.1.5 Feature encoding for project_grade_category

In [0]:
#https://stackoverflow.com/questions/3839729/count-unique-values-with-pandas-per-groups

# Fetching unique value counts for each class
project_grade_category_count = pd.DataFrame()
project_grade_category_count[1] = df_train['project_grade_category'].where(df_train['project_is_approved']==1).value_counts()
project_grade_category_count[0] = df_train['project_grade_category'].where(df_train['project_is_approved']==0).value_counts()

#Replacing nan value counts with zeros
project_grade_category_count[1]=project_grade_category_count[1].replace(np.nan,0)
project_grade_category_count[0]=project_grade_category_count[0].replace(np.nan,0)

#print(project_grade_category_count)

#Calculating probs for each class
for i in project_grade_category_count.iterrows():
  project_grade_category_count['1_prob'] = project_grade_category_count[1]/(project_grade_category_count[1]+project_grade_category_count[0])
  project_grade_category_count['0_prob'] = project_grade_category_count[0]/(project_grade_category_count[1]+project_grade_category_count[0])
  
#print(project_grade_category_count)

#appending prob values to train data in a new column
  
for idx,j in project_grade_category_count.iterrows():
  for indx,i in df_train.iterrows():
    if idx == df_train.at[indx, 'project_grade_category']:
      df_train.at[indx, 'project_grade_category_1'] = project_grade_category_count.at[idx, '1_prob']
      df_train.at[indx, 'project_grade_category_0'] = project_grade_category_count.at[idx, '0_prob']
      
print(df_train.head(2))
       Unnamed: 0       id  ... project_grade_category_1 project_grade_category_0
47021       44946  p007627  ...                 0.662707                 0.337293
48842       26216  p071199  ...                 0.662707                 0.337293

[2 rows x 35 columns]
In [0]:
#appending prob values to test data in a new column. Incase the class is not part of the train data, a prob of 0.5 is assigned
for idx,j in project_grade_category_count.iterrows():
  for indx,i in df_test.iterrows():
    if idx == df_test.at[indx, 'project_grade_category']:
      df_test.at[indx, 'project_grade_category_1'] = project_grade_category_count.at[idx, '1_prob']
      df_test.at[indx, 'project_grade_category_0'] = project_grade_category_count.at[idx, '0_prob']
      
df_test['project_grade_category_1']=df_test['project_grade_category_1'].replace(np.nan,0.5)
df_test['project_grade_category_0']=df_test['project_grade_category_0'].replace(np.nan,0.5)

print(df_test.head(2))
       Unnamed: 0       id  ... project_grade_category_1 project_grade_category_0
43188       98924  p204347  ...                 0.677308                 0.322692
39762      173403  p117233  ...                 0.677308                 0.322692

[2 rows x 35 columns]
In [0]:
print(len(df_train.columns), len(df_test.columns))
35 35

2.2.2 Vectorizing Numerical features

2.2.2.1 Vectorizing price

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# Reshape your data either using array.reshape(-1, 1)
print(df_train.columns)
price_scalar = StandardScaler()
price_scalar.fit(df_train['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_train_standardized = price_scalar.transform(df_train['price'].values.reshape(-1, 1))
price_test_standardized = price_scalar.transform(df_test['price'].values.reshape(-1, 1))
Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay', 'clean_essay',
       'clean_title', 'price', 'quantity', 'title_wc', 'essay_wc',
       'senti_score', 'cat_1', 'cat_0', 'subcat_1', 'subcat_0', 'state_1',
       'state_0', 'teacherprefix_1', 'teacherprefix_0',
       'project_grade_category_1', 'project_grade_category_0'],
      dtype='object')
Mean : 311.6786477142857, Standard deviation : 369.7872562957825

2.2.2.2 Vectorizing no. of previously posted projects

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

prev_proj_scalar = StandardScaler()
prev_proj_scalar.fit(df_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {prev_proj_scalar.mean_[0]}, Standard deviation : {np.sqrt(prev_proj_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
prev_proj_train_standardized = prev_proj_scalar.transform(df_train['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
prev_proj_test_standardized = prev_proj_scalar.transform(df_test['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
Mean : 10.380171428571428, Standard deviation : 26.468930270883593

2.2.2.3 Vectorizing word counts of project title

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

wc_title_scalar = StandardScaler()
wc_title_scalar.fit(df_train['title_wc'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {wc_title_scalar.mean_[0]}, Standard deviation : {np.sqrt(wc_title_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
wc_title_train_standardized = wc_title_scalar.transform(df_train['title_wc'].values.reshape(-1, 1))
wc_title_test_standardized = wc_title_scalar.transform(df_test['title_wc'].values.reshape(-1, 1))
Mean : 3.6698857142857144, Standard deviation : 1.5460166284714418

2.2.2.4 Vectorizing word counts of essay text

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

wc_essay_scalar = StandardScaler()
wc_essay_scalar.fit(df_train['essay_wc'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {wc_essay_scalar.mean_[0]}, Standard deviation : {np.sqrt(wc_essay_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
wc_essay_train_standardized = wc_essay_scalar.transform(df_train['essay_wc'].values.reshape(-1, 1))
wc_essay_test_standardized = wc_essay_scalar.transform(df_test['essay_wc'].values.reshape(-1, 1))
Mean : 136.6520857142857, Standard deviation : 35.60580227504776

2.2.2.5 Vectorizing sentimental scores of project essays

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

senti_score_scalar = StandardScaler()
senti_score_scalar.fit(df_train['senti_score'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {senti_score_scalar.mean_[0]}, Standard deviation : {np.sqrt(senti_score_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
senti_score_train_standardized = senti_score_scalar.transform(df_train['senti_score'].values.reshape(-1, 1))
senti_score_test_standardized = senti_score_scalar.transform(df_test['senti_score'].values.reshape(-1, 1))
Mean : 0.9589750199999999, Standard deviation : 0.15145545513638994

2.2.2.6 Vectorizing Quantity

In [0]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

qty_scalar = StandardScaler()
qty_scalar.fit(df_train['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {qty_scalar.mean_[0]}, Standard deviation : {np.sqrt(qty_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
qty_train_standardized = qty_scalar.transform(df_train['quantity'].values.reshape(-1, 1))
qty_test_standardized = qty_scalar.transform(df_test['quantity'].values.reshape(-1, 1))
Mean : 17.658885714285713, Standard deviation : 26.903832141559764

2.3 Make Data Model Ready: encoding eassay, and project_title</h2>

2.3.1 Vectorizing Text data

2.3.1.1 Bag of words for essay text

In [0]:
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer = CountVectorizer(min_df=10)
text_train_bow = vectorizer.fit_transform(df_train['clean_essay'])
text_test_bow = vectorizer.transform(df_test['clean_essay'])
print("Shape of matrix after one hot encoding ",text_train_bow.shape, text_test_bow.shape)
Shape of matrix after one hot encoding  (35000, 10447) (15000, 10447)
In [0]:
# you can vectorize the title also 
# before you vectorize the title make sure you preprocess it

vectorizer = CountVectorizer(min_df=10)
title_train_bow = vectorizer.fit_transform(df_train['clean_title'])
title_test_bow = vectorizer.transform(df_test['clean_title'])
print("Shape of matrix after one hot encoding ", title_train_bow.shape, title_test_bow.shape)
Shape of matrix after one hot encoding  (35000, 1559) (15000, 1559)

2.3.1.2 TFIDF vectorizer for essay text

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10)

text_train_tfidf = vectorizer.fit_transform(df_train['clean_essay'])
text_test_tfidf = vectorizer.transform(df_test['clean_essay'])
print("Shape of matrix after one hot encoding ",text_train_tfidf.shape, text_test_tfidf.shape)
Shape of matrix after one hot encoding  (35000, 10447) (15000, 10447)
In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10)

title_train_tfidf = vectorizer.fit_transform(df_train['clean_title'])
title_test_tfidf = vectorizer.transform(df_test['clean_title'])

print("Shape of matrix after one hot encodig ",title_train_tfidf.shape, title_test_tfidf.shape)
Shape of matrix after one hot encodig  (35000, 1559) (15000, 1559)

2.3.1.3 Using Pretrained models: Avg W2V vectorizer

In [0]:
'''def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    f = open(gloveFile,'r', encoding="utf8")
    model = {}
    for line in tqdm(f):
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
model = loadGloveModel('drive/My Drive/Colab Notebooks/glove.6B.50d.txt')'''
Out[0]:
'def loadGloveModel(gloveFile):\n    print ("Loading Glove Model")\n    f = open(gloveFile,\'r\', encoding="utf8")\n    model = {}\n    for line in tqdm(f):\n        splitLine = line.split()\n        word = splitLine[0]\n        embedding = np.array([float(val) for val in splitLine[1:]])\n        model[word] = embedding\n    print ("Done.",len(model)," words loaded!")\n    return model\nmodel = loadGloveModel(\'drive/My Drive/Colab Notebooks/glove.6B.50d.txt\')'
In [0]:
'''words = []
for i in preprocessed_essays:
    words.extend(i.split(' '))

for i in preprocessed_title:
    words.extend(i.split(' '))
print("all the words in the coupus", len(words))
words = set(words)
print("the unique words in the coupus", len(words))

inter_words = set(model.keys()).intersection(words)
print("The number of words that are present in both glove vectors and our coupus", \
      len(inter_words),"(",np.round(len(inter_words)/len(words)*100,3),"%)")

words_courpus = {}
words_glove = set(model.keys())
for i in words:
    if i in words_glove:
        words_courpus[i] = model[i]
print("word 2 vec length", len(words_courpus))


# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/

import pickle
with open('drive/My Drive/Colab Notebooks/glove_vectors_50', 'wb') as f:
    pickle.dump(words_courpus, f)'''
Out[0]:
'words = []\nfor i in preprocessed_essays:\n    words.extend(i.split(\' \'))\n\nfor i in preprocessed_title:\n    words.extend(i.split(\' \'))\nprint("all the words in the coupus", len(words))\nwords = set(words)\nprint("the unique words in the coupus", len(words))\n\ninter_words = set(model.keys()).intersection(words)\nprint("The number of words that are present in both glove vectors and our coupus",       len(inter_words),"(",np.round(len(inter_words)/len(words)*100,3),"%)")\n\nwords_courpus = {}\nwords_glove = set(model.keys())\nfor i in words:\n    if i in words_glove:\n        words_courpus[i] = model[i]\nprint("word 2 vec length", len(words_courpus))\n\n\n# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/\n\nimport pickle\nwith open(\'drive/My Drive/Colab Notebooks/glove_vectors_50\', \'wb\') as f:\n    pickle.dump(words_courpus, f)'
In [0]:
# storing variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file

with open('drive/My Drive/Colab Notebooks/glove_vectors_50', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
In [0]:
# average Word2Vec
# compute average word2vec for each review.

avg_w2v_train_text_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_train['clean_essay']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_train_text_vectors.append(vector)

print(len(avg_w2v_train_text_vectors))
print(len(avg_w2v_train_text_vectors[0]))
100%|██████████| 35000/35000 [00:08<00:00, 4054.14it/s]
35000
50

In [0]:
# average Word2Vec
# compute average word2vec for each review.

avg_w2v_test_text_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_test['clean_essay']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_test_text_vectors.append(vector)

print(len(avg_w2v_test_text_vectors))
print(len(avg_w2v_test_text_vectors[0]))
100%|██████████| 15000/15000 [00:03<00:00, 4064.47it/s]
15000
50

In [0]:
# Similarly you can vectorize for title also

# average Word2Vec
# compute average word2vec for each title
avg_w2v_title_train_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_train['clean_title']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_title_train_vectors.append(vector)

print(len(avg_w2v_title_train_vectors))
print(len(avg_w2v_title_train_vectors[0]))
100%|██████████| 35000/35000 [00:00<00:00, 78082.66it/s]
35000
50

In [0]:
# Similarly you can vectorize for title also

# average Word2Vec
# compute average word2vec for each title
avg_w2v_title_test_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_test['clean_title']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_title_test_vectors.append(vector)

print(len(avg_w2v_title_test_vectors))
print(len(avg_w2v_title_test_vectors[0]))
100%|██████████| 15000/15000 [00:00<00:00, 76058.45it/s]
15000
50

2.3.1.4 Using Pretrained Models: TFIDF weighted W2V for essay text

In [0]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model = TfidfVectorizer()
tfidf_model.fit_transform(df_train['clean_essay'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())
In [0]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_train_text_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_train['clean_essay']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_train_text_vectors.append(vector)

print(len(tfidf_w2v_train_text_vectors))
print(len(tfidf_w2v_train_text_vectors[0]))
100%|██████████| 35000/35000 [01:02<00:00, 557.10it/s]
35000
50

In [0]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_test_text_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_test['clean_essay']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_test_text_vectors.append(vector)

print(len(tfidf_w2v_test_text_vectors))
print(len(tfidf_w2v_test_text_vectors[0]))
100%|██████████| 15000/15000 [00:26<00:00, 562.31it/s]
15000
50

2.3.1.4 Using Pretrained Models: TFIDF weighted W2V for title

In [0]:
# Similarly you can vectorize for title also

tfidf_model = TfidfVectorizer()
tfidf_model.fit_transform(df_train['clean_title'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())
In [0]:
# average Word2Vec
# compute average word2vec for each project title.
tfidf_w2v_train_title_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_train['clean_title']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_train_title_vectors.append(vector)

print(len(tfidf_w2v_train_title_vectors))
print(len(tfidf_w2v_train_title_vectors[0]))
100%|██████████| 35000/35000 [00:00<00:00, 40101.08it/s]
35000
50

In [0]:
# average Word2Vec
# compute average word2vec for each project title.
tfidf_w2v_test_title_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(df_test['clean_title']): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length. 50 is the size of each vector in glove file
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_test_title_vectors.append(vector)

print(len(tfidf_w2v_test_title_vectors))
print(len(tfidf_w2v_test_title_vectors[0]))
100%|██████████| 15000/15000 [00:00<00:00, 41235.28it/s]
15000
50

2.4 Applying Decision Tree Classifier on different kinds of featurizations as mentioned in the instructions

2.4.1 Applying Decision Tree Classifier on BOW featurization, SET 1

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV

import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1),
                  price_train_standardized, prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized,
                  senti_score_train_standardized, qty_train_standardized, text_train_bow, title_train_bow))
y_train = df_train['project_is_approved']

x_test = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1),
                  df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized, 
                 qty_test_standardized, text_test_bow, title_test_bow))
y_test = df_test['project_is_approved']

print(x_train.shape, type(x_train), y_train.shape, type(y_train))
print(x_test.shape, type(x_test), y_test.shape, type(y_test))
(35000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.ensemble import RandomForestClassifier

#Initialising Classifier
classifier = RandomForestClassifier(class_weight='balanced')

#Brute force approach for finding best K value
parameters = {'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'n_estimators': [5, 10, 50, 100, 200, 500, 1000]}

#Training the model on train data
RF_BoW = GridSearchCV(classifier, parameters, cv=3, return_train_score=True, scoring='roc_auc', n_jobs=-1)
RF_BoW.fit(x_train, y_train)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True,
                                              class_weight='balanced',
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'n_estimators': [5, 10, 50, 100, 200, 500, 1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://qiita.com/bmj0114/items/8009f282c99b77780563

print(RF_BoW.best_params_) #Gives the best value of parameters from the given range

train_scores = RF_BoW.cv_results_['mean_train_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))
test_scores = RF_BoW.cv_results_['mean_test_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))

df_tr=pd.DataFrame(train_scores)
df_tr.index=parameters['max_depth']
df_tr.columns=parameters['n_estimators']

df_te=pd.DataFrame(test_scores)
df_te.index=parameters['max_depth']
df_te.columns=parameters['n_estimators']

plt.subplots(figsize=(24,4))
plt.subplot(1,2,1)
sns.heatmap(df_tr, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Train data')
plt.subplots_adjust(wspace=0.5)

plt.subplot(1,2,2)
sns.heatmap(df_te, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Test data')
plt.subplots_adjust(wspace=0.5)
plt.show()

plt.close()
{'max_depth': 10, 'n_estimators': 1000}
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_RF_BoW = RandomForestClassifier(max_depth=10, n_estimators=1000, class_weight='balanced')
final_RF_BoW.fit(x_train,y_train)
Out[0]:
RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=1000, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)
In [0]:
x_train_csr=x_train.tocsr()
x_test_csr=x_test.tocsr()

y_train_pred=[]
y_test_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train.shape[0]):
    y_train_pred.extend(final_RF_BoW.predict_proba(x_train_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test.shape[0]):
    y_test_pred.extend(final_RF_BoW.predict_proba(x_test_csr[i])[:,1])
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
#https://stats.stackexchange.com/questions/105501/understanding-roc-curve

#Calculating FPR and TPR for train and test data
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)

#Calculating AUC for train and test curves
roc_auc_train=auc(train_fpr,train_tpr)
roc_auc_test=auc(test_fpr,test_tpr)

plt.plot(train_fpr, train_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_train)
plt.plot(test_fpr, test_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for BOW")
plt.grid()

plt.show()
plt.close()
In [0]:
np.median(train_thresholds)
Out[0]:
0.49368122556616933
In [0]:
#https://medium.com/hugo-ferreiras-blog/confusion-matrix-and-other-metrics-in-machine-learning-894688cb1c0a
#http://mlwiki.org/index.php/ROC_Analysis

'''
from sklearn.metrics import precision_recall_curve 

precision, recall, thresholds = precision_recall_curve(y_train, y_train_pred)

# create plot
plt.plot(precision, recall, label='Precision-recall curve')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-recall curve')
plt.legend(loc="lower left")'''
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_train=[]

expected_train = y_train.values
for i in range(0,x_train.shape[0]):
    predicted_train.extend((final_RF_BoW.predict_proba(x_train_csr[i])[:,1]>= 0.4937).astype(bool))

predicted_test=[]

expected_test = y_test.values
for i in range(0,x_test.shape[0]):
    predicted_test.extend((final_RF_BoW.predict_proba(x_test_csr[i])[:,1]>= 0.4937).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_train, predicted_train)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using BoW ')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_test, predicted_test)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using BoW ')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.2 Applying GBDT Classifier brute force on TFIDF, SET 1 (GridSearch)

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV

import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1),
                  price_train_standardized, prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized,
                  senti_score_train_standardized, qty_train_standardized, text_train_bow, title_train_bow))
y_train = df_train['project_is_approved']

x_test = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1),
                 df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized, 
                 qty_test_standardized, text_test_bow, title_test_bow))
y_test = df_test['project_is_approved']

print(x_train.shape, type(x_train), y_train.shape, type(y_train))
print(x_test.shape, type(x_test), y_test.shape, type(y_test))
(35000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.ensemble import GradientBoostingClassifier

#Initialising Classifier
classifier = GradientBoostingClassifier()

#Brute force approach for finding best K value
parameters = {'n_estimators': [5, 10, 50, 100, 200, 500]}

#Training the model on train data
GBDT_BoW = GridSearchCV(classifier, parameters, cv=3, return_train_score=True, scoring='roc_auc', n_jobs=-1)
GBDT_BoW.fit(x_train, y_train)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                                  presort='auto',
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'n_estimators': [5, 10, 50, 100, 200, 500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://qiita.com/bmj0114/items/8009f282c99b77780563

print(GBDT_BoW.best_params_) #Gives the best value of parameters from the given range

print(GBDT_BoW.cv_results_['mean_train_score'])
print(GBDT_BoW.cv_results_['mean_test_score'])
print(parameters['n_estimators'])

plt.figure(figsize=(10,3))
plt.plot(parameters['n_estimators'],GBDT_BoW.cv_results_['mean_train_score'], label="Train")
plt.plot(parameters['n_estimators'],GBDT_BoW.cv_results_['mean_test_score'], label="Test")
plt.title('AUC plot for train and test datasets')
plt.xlabel('n_estimator values')
plt.ylabel('Area under ROC Curve')
plt.legend()
plt.grid()
plt.show()
plt.close()

plt.close()
{'n_estimators': 500}
[0.6848819  0.7029891  0.7521622  0.77818818 0.81133392 0.86700529]
[0.67088538 0.68784521 0.72522508 0.73643249 0.74355008 0.74698039]
[5, 10, 50, 100, 200, 500]
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_GBDT_BoW = GradientBoostingClassifier(n_estimators=500)
final_GBDT_BoW.fit(x_train,y_train)

x_train_csr=x_train.tocsr()
x_test_csr=x_test.tocsr()

y_train_pred=[]
y_test_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train.shape[0]):
    y_train_pred.extend(final_GBDT_BoW.predict_proba(x_train_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test.shape[0]):
    y_test_pred.extend(final_GBDT_BoW.predict_proba(x_test_csr[i])[:,1])
In [0]:
import dill
#dill.dump_session('drive/My Drive/Colab Notebooks/sess_GBDT.pckl')
#dill.load_session('drive/My Drive/Colab Notebooks/sess_GBDT.pckl')
/usr/local/lib/python3.6/dist-packages/nltk/twitter/__init__.py:20: UserWarning:

The twython library has not been installed. Some functionality from the twitter package will not be available.

In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)

#Calculating AUC for train and test curves
roc_auc_train=auc(train_fpr,train_tpr)
roc_auc_test=auc(test_fpr,test_tpr)

plt.plot(train_fpr, train_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_train)
plt.plot(test_fpr, test_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for BOW")
plt.grid()

plt.show()
plt.close()
In [0]:
np.median(train_thresholds)
Out[0]:
0.6206635462915255
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_train=[]

expected_train = y_train.values
for i in range(0,x_train.shape[0]):
    predicted_train.extend((final_GBDT_BoW.predict_proba(x_train_csr[i])[:,1]>= 0.6207).astype(bool))

predicted_test=[]

expected_test = y_test.values
for i in range(0,x_test.shape[0]):
    predicted_test.extend((final_GBDT_BoW.predict_proba(x_test_csr[i])[:,1]>= 0.6207).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_train, predicted_train)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using BoW ')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_test, predicted_test)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using BoW ')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.3 Applying RF Classifier brute force on TFIDF, SET 2 (GridSearch)

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_tfidf = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), price_train_standardized,
                  prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                        qty_train_standardized, text_train_tfidf, title_train_tfidf))
y_train_tfidf = df_train['project_is_approved']

x_test_tfidf = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1),
                       df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, text_test_tfidf, title_test_tfidf))
y_test_tfidf = df_test['project_is_approved']

print(x_train_tfidf.shape, type(x_train_tfidf), y_train_tfidf.shape, type(y_train_tfidf))
print(x_test_tfidf.shape, type(x_test_tfidf), y_test_tfidf.shape, type(y_test_tfidf))
(35000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = RandomForestClassifier(class_weight='balanced')

#Brute force approach for finding best K value
parameters = {'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'n_estimators': [5, 10, 50, 100, 200, 500, 1000]}

#Training the model on train data
RF_TFIDF = GridSearchCV(classifier, parameters, cv=3, return_train_score=True, scoring='roc_auc', n_jobs=-1)
RF_TFIDF.fit(x_train_tfidf, y_train_tfidf)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True,
                                              class_weight='balanced',
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'n_estimators': [5, 10, 50, 100, 200, 500, 1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(RF_TFIDF.best_params_) #Gives the best value of parameters from the given range

train_scores = RF_TFIDF.cv_results_['mean_train_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))
test_scores = RF_TFIDF.cv_results_['mean_test_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))

df_tr=pd.DataFrame(train_scores)
df_tr.index=parameters['max_depth']
df_tr.columns=parameters['n_estimators']

df_te=pd.DataFrame(test_scores)
df_te.index=parameters['max_depth']
df_te.columns=parameters['n_estimators']

plt.subplots(figsize=(24,4))
plt.subplot(1,2,1)
sns.heatmap(df_tr, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Train data')
plt.subplots_adjust(wspace=0.5)

plt.subplot(1,2,2)
sns.heatmap(df_te, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Test data')
plt.subplots_adjust(wspace=0.5)
plt.show()

plt.close()
{'max_depth': 10, 'n_estimators': 1000}
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_RF_tfidf = RandomForestClassifier(max_depth=10, n_estimators=1000, class_weight='balanced')
final_RF_tfidf.fit(x_train_tfidf,y_train_tfidf)

x_train_tfidf_csr=x_train_tfidf.tocsr()
x_test_tfidf_csr=x_test_tfidf.tocsr()

y_train_tfidf_pred=[]
y_test_tfidf_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_tfidf.shape[0]):
    y_train_tfidf_pred.extend(final_RF_tfidf.predict_proba(x_train_tfidf_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_tfidf.shape[0]):
    y_test_tfidf_pred.extend(final_RF_tfidf.predict_proba(x_test_tfidf_csr[i])[:,1])
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_tfidf_fpr, train_tfidf_tpr, train_tfidf_thresholds = roc_curve(y_train_tfidf, y_train_tfidf_pred)
test_tfidf_fpr, test_tfidf_tpr, test_tfidf_thresholds = roc_curve(y_test_tfidf, y_test_tfidf_pred)

#Calculating AUC for train and test curves
roc_auc_tfidf_train=auc(train_tfidf_fpr,train_tfidf_tpr)
roc_auc_tfidf_test=auc(test_tfidf_fpr,test_tfidf_tpr)

plt.plot(train_tfidf_fpr, train_tfidf_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_tfidf_train)
plt.plot(test_tfidf_fpr, test_tfidf_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_tfidf_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for TFIDF")
plt.grid()
plt.show()
plt.close()
In [0]:
print(np.median(train_tfidf_thresholds))
0.4994960603199601
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_train_tfidf=[]

expected_train_tfidf = y_train_tfidf.values
for i in range(0,x_train_tfidf_csr.shape[0]):
    predicted_train_tfidf.extend((final_RF_tfidf.predict_proba(x_train_tfidf_csr[i])[:,1]>= 0.4995).astype(bool))

predicted_test_tfidf=[]

expected_test_tfidf = y_test_tfidf.values
for i in range(0,x_test_tfidf_csr.shape[0]):
    predicted_test_tfidf.extend((final_RF_tfidf.predict_proba(x_test_tfidf_csr[i])[:,1]>= 0.4995).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_train_tfidf, predicted_train_tfidf)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using Avg W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_test_tfidf, predicted_test_tfidf)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using Avg W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.4 Applying GBDT Classifier brute force on TFIDF, SET 2 (GridSearch)

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_tfidf = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), price_train_standardized,
                  prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                        qty_train_standardized, text_train_tfidf, title_train_tfidf))
y_train_tfidf = df_train['project_is_approved']

x_test_tfidf = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1), 
                       df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, text_test_tfidf, title_test_tfidf))
y_test_tfidf = df_test['project_is_approved']

print(x_train_tfidf.shape, type(x_train_tfidf), y_train_tfidf.shape, type(y_train_tfidf))
print(x_test_tfidf.shape, type(x_test_tfidf), y_test_tfidf.shape, type(y_test_tfidf))
(35000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 12022) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = GradientBoostingClassifier()

#Brute force approach for finding best K value
parameters = {'n_estimators': [5, 10, 50, 100, 200, 500]}

#Training the model on train data
GBDT_TFIDF = GridSearchCV(classifier, parameters, cv=3, return_train_score=True, scoring='roc_auc', n_jobs=-1)
GBDT_TFIDF.fit(x_train_tfidf, y_train_tfidf)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                                  presort='auto',
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'n_estimators': [5, 10, 50, 100, 200, 500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(GBDT_TFIDF.best_params_) #Gives the best value of parameters from the given range

print(GBDT_TFIDF.cv_results_['mean_train_score'])
print(GBDT_TFIDF.cv_results_['mean_test_score'])
print(parameters['n_estimators'])

plt.figure(figsize=(10,3))
plt.plot(parameters['n_estimators'],GBDT_TFIDF.cv_results_['mean_train_score'], label="Train")
plt.plot(parameters['n_estimators'],GBDT_TFIDF.cv_results_['mean_test_score'], label="Test")
plt.title('AUC plot for train and test datasets')
plt.xlabel('n_estimator values')
plt.ylabel('Area under ROC Curve')
plt.legend()
plt.grid()
plt.show()
plt.close()

plt.close()
{'n_estimators': 500}
[0.68970192 0.70523333 0.75880288 0.78940395 0.82691776 0.88789916]
[0.6737828  0.68726636 0.72606083 0.73706577 0.74363092 0.74647618]
[5, 10, 50, 100, 200, 500]
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_GBDT_tfidf = GradientBoostingClassifier(n_estimators=500)
final_GBDT_tfidf.fit(x_train_tfidf,y_train_tfidf)

x_train_tfidf_csr=x_train_tfidf.tocsr()
x_test_tfidf_csr=x_test_tfidf.tocsr()

y_train_tfidf_pred=[]
y_test_tfidf_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_tfidf.shape[0]):
    y_train_tfidf_pred.extend(final_GBDT_tfidf.predict_proba(x_train_tfidf_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_tfidf.shape[0]):
    y_test_tfidf_pred.extend(final_GBDT_tfidf.predict_proba(x_test_tfidf_csr[i])[:,1])
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_tfidf_fpr, train_tfidf_tpr, train_tfidf_thresholds = roc_curve(y_train_tfidf, y_train_tfidf_pred)
test_tfidf_fpr, test_tfidf_tpr, test_tfidf_thresholds = roc_curve(y_test_tfidf, y_test_tfidf_pred)

#Calculating AUC for train and test curves
roc_auc_tfidf_train=auc(train_tfidf_fpr,train_tfidf_tpr)
roc_auc_tfidf_test=auc(test_tfidf_fpr,test_tfidf_tpr)

plt.plot(train_tfidf_fpr, train_tfidf_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_tfidf_train)
plt.plot(test_tfidf_fpr, test_tfidf_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_tfidf_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for TFIDF")
plt.grid()
plt.show()
plt.close()
In [0]:
print(np.median(train_tfidf_thresholds))
0.625259944085036
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_train_tfidf=[]

expected_train_tfidf = y_train_tfidf.values
for i in range(0,x_train.shape[0]):
    predicted_train_tfidf.extend((final_GBDT_tfidf.predict_proba(x_train_tfidf_csr[i])[:,1]>= 0.6252).astype(bool))

predicted_test_tfidf=[]

expected_test_tfidf = y_test_tfidf.values
for i in range(0,x_test.shape[0]):
    predicted_test_tfidf.extend((final_GBDT_tfidf.predict_proba(x_test_tfidf_csr[i])[:,1]>= 0.6252).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_train_tfidf, predicted_train_tfidf)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using Avg W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_test_tfidf, predicted_test_tfidf)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using Avg W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.5 Applying RF Classifier brute force on AVG W2V, SET 3

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_avg_w2v = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), 
                  price_train_standardized, prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                  qty_train_standardized, title_train_bow, avg_w2v_train_text_vectors, avg_w2v_title_train_vectors))
y_train_avg_w2v = df_train['project_is_approved']

x_test_avg_w2v = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1), 
                         df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, title_test_bow, avg_w2v_test_text_vectors, avg_w2v_title_test_vectors))
y_test_avg_w2v = df_test['project_is_approved']

print(x_train_avg_w2v.shape, type(x_train_avg_w2v), y_train_avg_w2v.shape, type(y_train_avg_w2v))
print(x_test_avg_w2v.shape, type(x_test_avg_w2v), y_test_avg_w2v.shape, type(y_test_avg_w2v))
(35000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = RandomForestClassifier(class_weight='balanced')

#Brute force approach for finding best K value
parameters = {'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'n_estimators': [5, 10, 50, 100, 200, 500, 1000]}

#Training the model on train data
RF_avg_w2v = GridSearchCV(classifier, parameters, return_train_score=True, cv=3, scoring='roc_auc', n_jobs=-1)
RF_avg_w2v.fit(x_train_avg_w2v, y_train_avg_w2v)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True,
                                              class_weight='balanced',
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'n_estimators': [5, 10, 50, 100, 200, 500, 1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(RF_avg_w2v.best_params_) #Gives the best value of parameters from the given range

train_scores = RF_avg_w2v.cv_results_['mean_train_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))
test_scores = RF_avg_w2v.cv_results_['mean_test_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))

df_tr=pd.DataFrame(train_scores)
df_tr.index=parameters['max_depth']
df_tr.columns=parameters['n_estimators']

df_te=pd.DataFrame(test_scores)
df_te.index=parameters['max_depth']
df_te.columns=parameters['n_estimators']

plt.subplots(figsize=(20,4))
plt.subplot(1,2,1)
sns.heatmap(df_tr, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Train data')
plt.subplots_adjust(wspace=0.5)

plt.subplot(1,2,2)
sns.heatmap(df_te, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Test data')
plt.subplots_adjust(wspace=0.5)
plt.show()

plt.close()
{'max_depth': 10, 'n_estimators': 500}
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_RF_avg_w2v = RandomForestClassifier(max_depth=10, n_estimators=500, class_weight='balanced')
final_RF_avg_w2v.fit(x_train_avg_w2v, y_train_avg_w2v)

x_train_avg_w2v_csr=x_train_avg_w2v.tocsr()
x_test_avg_w2v_csr=x_test_avg_w2v.tocsr()

y_train_avg_w2v_pred=[]
y_test_avg_w2v_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_avg_w2v.shape[0]):
    y_train_avg_w2v_pred.extend(final_RF_avg_w2v.predict_proba(x_train_avg_w2v_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_avg_w2v.shape[0]):
    y_test_avg_w2v_pred.extend(final_RF_avg_w2v.predict_proba(x_test_avg_w2v_csr[i])[:,1])
    
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_avg_w2v_fpr, train_avg_w2v_tpr, train_avg_w2v_thresholds = roc_curve(y_train_avg_w2v, y_train_avg_w2v_pred)
test_avg_w2v_fpr, test_avg_w2v_tpr, test_avg_w2v_thresholds = roc_curve(y_test_avg_w2v, y_test_avg_w2v_pred)

#Calculating AUC for train and test curves
roc_auc_avg_w2v_train=auc(train_avg_w2v_fpr,train_avg_w2v_tpr)
roc_auc_avg_w2v_test=auc(test_avg_w2v_fpr,test_avg_w2v_tpr)

plt.plot(train_avg_w2v_fpr, train_avg_w2v_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_avg_w2v_train)
plt.plot(test_avg_w2v_fpr, test_avg_w2v_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_avg_w2v_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for AVG W2V")
plt.grid()
plt.show()
plt.close()
In [0]:
print(np.median(train_avg_w2v_thresholds))
0.5009357733666363
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_avg_train_w2v=[]

expected_avg_train_w2v = y_train_avg_w2v.values
for i in range(0,x_train_avg_w2v.shape[0]):
  predicted_avg_train_w2v.extend((final_RF_avg_w2v.predict_proba(x_train_avg_w2v_csr[i])[:,1]>= 0.501).astype(bool))
In [0]:
predicted_avg_test_w2v =[]
expected_avg_test_w2v = y_test_avg_w2v.values
for i in range(0,x_test_avg_w2v.shape[0]):
    predicted_avg_test_w2v.extend((final_RF_avg_w2v.predict_proba(x_test_avg_w2v_csr[i])[:,1]>= 0.501).astype(bool))

plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_avg_train_w2v, predicted_avg_train_w2v)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using Avg W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_avg_test_w2v, predicted_avg_test_w2v)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using Avg W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.6 Applying GBDT Classifier brute force on AVG W2V, SET 3

Hyper paramter tuning method: GridSearch

In [2]:
import dill
#dill.dump_session('drive/My Drive/Colab Notebooks/sess_GBDT.pckl')
dill.load_session('drive/My Drive/Colab Notebooks/sess_GBDT.pckl')
/usr/local/lib/python3.6/dist-packages/nltk/twitter/__init__.py:20: UserWarning:

The twython library has not been installed. Some functionality from the twitter package will not be available.

In [22]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_avg_w2v = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), 
                  price_train_standardized, prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                  qty_train_standardized, title_train_bow, avg_w2v_train_text_vectors, avg_w2v_title_train_vectors))
y_train_avg_w2v = df_train['project_is_approved']

x_test_avg_w2v = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1), 
                         df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, title_test_bow, avg_w2v_test_text_vectors, avg_w2v_title_test_vectors))
y_test_avg_w2v = df_test['project_is_approved']

print(x_train_avg_w2v.shape, type(x_train_avg_w2v), y_train_avg_w2v.shape, type(y_train_avg_w2v))
print(x_test_avg_w2v.shape, type(x_test_avg_w2v), y_test_avg_w2v.shape, type(y_test_avg_w2v))
(35000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [23]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = GradientBoostingClassifier()

#Brute force approach for finding best K value
parameters = {'n_estimators': [5, 10, 50, 100, 200, 500]}

#Training the model on train data
GBDT_avg_w2v = GridSearchCV(classifier, parameters, return_train_score=True, cv=3, scoring='roc_auc', n_jobs=-1)
GBDT_avg_w2v.fit(x_train_avg_w2v, y_train_avg_w2v)
Out[23]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                                  presort='auto',
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'n_estimators': [5, 10, 50, 100, 200, 500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [24]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(GBDT_avg_w2v.best_params_) #Gives the best value of parameters from the given range

print(GBDT_avg_w2v.cv_results_['mean_train_score'])
print(GBDT_avg_w2v.cv_results_['mean_test_score'])
print(parameters['n_estimators'])

plt.figure(figsize=(10,3))
plt.plot(parameters['n_estimators'],GBDT_avg_w2v.cv_results_['mean_train_score'], label="Train")
plt.plot(parameters['n_estimators'],GBDT_avg_w2v.cv_results_['mean_test_score'], label="Test")
plt.title('AUC plot for train and test datasets')
plt.xlabel('n_estimator values')
plt.ylabel('Area under ROC Curve')
plt.legend()
plt.grid()
plt.show()
plt.close()

plt.close()
{'n_estimators': 200}
[0.69619934 0.71381102 0.75679039 0.77965458 0.80796692 0.85836408]
[0.68278247 0.6977755  0.72593091 0.73335136 0.73608661 0.73601381]
[5, 10, 50, 100, 200, 500]
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_GBDT_avg_w2v = GradientBoostingClassifier(n_estimators=200)
final_GBDT_avg_w2v.fit(x_train_avg_w2v, y_train_avg_w2v)

x_train_avg_w2v_csr=x_train_avg_w2v.tocsr()
x_test_avg_w2v_csr=x_test_avg_w2v.tocsr()

y_train_avg_w2v_pred=[]
y_test_avg_w2v_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_avg_w2v.shape[0]):
    y_train_avg_w2v_pred.extend(final_GBDT_avg_w2v.predict_proba(x_train_avg_w2v_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_avg_w2v.shape[0]):
    y_test_avg_w2v_pred.extend(final_GBDT_avg_w2v.predict_proba(x_test_avg_w2v_csr[i])[:,1])
    
In [26]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_avg_w2v_fpr, train_avg_w2v_tpr, train_avg_w2v_thresholds = roc_curve(y_train_avg_w2v, y_train_avg_w2v_pred)
test_avg_w2v_fpr, test_avg_w2v_tpr, test_avg_w2v_thresholds = roc_curve(y_test_avg_w2v, y_test_avg_w2v_pred)

#Calculating AUC for train and test curves
roc_auc_avg_w2v_train=auc(train_avg_w2v_fpr,train_avg_w2v_tpr)
roc_auc_avg_w2v_test=auc(test_avg_w2v_fpr,test_avg_w2v_tpr)

plt.plot(train_avg_w2v_fpr, train_avg_w2v_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_avg_w2v_train)
plt.plot(test_avg_w2v_fpr, test_avg_w2v_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_avg_w2v_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for AVG W2V")
plt.grid()
plt.show()
plt.close()
In [27]:
print(np.median(train_avg_w2v_thresholds))
0.6485078678409508
In [28]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_avg_train_w2v=[]

expected_avg_train_w2v = y_train_avg_w2v.values
for i in range(0,x_train_avg_w2v.shape[0]):
    predicted_avg_train_w2v.extend((final_GBDT_avg_w2v.predict_proba(x_train_avg_w2v_csr[i])[:,1]>= 0.6485).astype(bool))

predicted_avg_test_w2v=[]

expected_avg_test_w2v = y_test_avg_w2v.values
for i in range(0,x_test_avg_w2v.shape[0]):
    predicted_avg_test_w2v.extend((final_GBDT_avg_w2v.predict_proba(x_test_avg_w2v_csr[i])[:,1]>= 0.6485).astype(bool))

plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_avg_train_w2v, predicted_avg_train_w2v)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using Avg W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_avg_test_w2v, predicted_avg_test_w2v)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using Avg W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.7 Applying RF Classifier brute force on TFIDF W2V, SET 4

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_tfidf_w2v = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), price_train_standardized,
                  prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                        qty_train_standardized, title_train_bow, tfidf_w2v_train_text_vectors, tfidf_w2v_train_title_vectors))
y_train_tfidf_w2v = df_train['project_is_approved']

x_test_tfidf_w2v = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1), 
                           df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, title_test_bow, tfidf_w2v_test_text_vectors, tfidf_w2v_test_title_vectors))
y_test_tfidf_w2v = df_test['project_is_approved']

print(x_train_tfidf_w2v.shape, type(x_train_tfidf_w2v), y_train_tfidf_w2v.shape, type(y_train_tfidf_w2v))
print(x_test_tfidf_w2v.shape, type(x_test_tfidf_w2v), y_test_tfidf_w2v.shape, type(y_test_tfidf_w2v))
(35000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = RandomForestClassifier(class_weight='balanced')

#Brute force approach for finding best K value
parameters = {'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'n_estimators': [5, 10, 50, 100, 200, 500, 1000]}

#Training the model on train data
RF_tfidf_w2v = GridSearchCV(classifier, parameters, return_train_score=True, cv=3, scoring='roc_auc', n_jobs=-1)
RF_tfidf_w2v.fit(x_train_tfidf_w2v, y_train_tfidf_w2v)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True,
                                              class_weight='balanced',
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'n_estimators': [5, 10, 50, 100, 200, 500, 1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(RF_tfidf_w2v.best_params_) #Gives the best value of parameters from the given range

train_scores = RF_tfidf_w2v.cv_results_['mean_train_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))
test_scores = RF_tfidf_w2v.cv_results_['mean_test_score'].reshape(len(parameters['max_depth']),len(parameters['n_estimators']))

df_tr=pd.DataFrame(train_scores)
df_tr.index=parameters['max_depth']
df_tr.columns=parameters['n_estimators']

df_te=pd.DataFrame(test_scores)
df_te.index=parameters['max_depth']
df_te.columns=parameters['n_estimators']

plt.subplots(figsize=(20,4))
plt.subplot(1,2,1)
sns.heatmap(df_tr, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Train data')
plt.subplots_adjust(wspace=0.5)

plt.subplot(1,2,2)
sns.heatmap(df_te, annot=True,annot_kws={"size": 10}, fmt='g')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.title('AUC plot for Test data')
plt.subplots_adjust(wspace=0.5)
plt.show()

plt.close()
{'max_depth': 10, 'n_estimators': 1000}
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_RF_tfidf_w2v = RandomForestClassifier(max_depth=10, n_estimators=1000, class_weight='balanced')
final_RF_tfidf_w2v.fit(x_train_tfidf_w2v, y_train_tfidf_w2v)

x_train_tfidf_w2v_csr=x_train_tfidf_w2v.tocsr()
x_test_tfidf_w2v_csr=x_test_tfidf_w2v.tocsr()

y_train_tfidf_w2v_pred=[]
y_test_tfidf_w2v_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_tfidf_w2v.shape[0]):
    y_train_tfidf_w2v_pred.extend(final_RF_tfidf_w2v.predict_proba(x_train_tfidf_w2v_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_tfidf_w2v.shape[0]):
    y_test_tfidf_w2v_pred.extend(final_RF_tfidf_w2v.predict_proba(x_test_tfidf_w2v_csr[i])[:,1])
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_tfidf_w2v_fpr, train_tfidf_w2v_tpr, train_tfidf_w2v_thresholds = roc_curve(y_train_tfidf_w2v, y_train_tfidf_w2v_pred)
test_tfidf_w2v_fpr, test_tfidf_w2v_tpr, test_tfidf_w2v_thresholds = roc_curve(y_test_tfidf_w2v, y_test_tfidf_w2v_pred)

#Calculating AUC for train and test curves
roc_auc_tfidf_w2v_train=auc(train_tfidf_w2v_fpr,train_tfidf_w2v_tpr)
roc_auc_tfidf_w2v_test=auc(test_tfidf_w2v_fpr,test_tfidf_w2v_tpr)

plt.plot(train_tfidf_w2v_fpr, train_tfidf_w2v_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_tfidf_w2v_train)
plt.plot(test_tfidf_w2v_fpr, test_tfidf_w2v_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_tfidf_w2v_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for TFIDF AVGW2V")
plt.grid()
plt.show()
plt.close()
In [0]:
print(np.median(train_tfidf_w2v_thresholds))
0.4941339115572567
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_tfidf_train_w2v=[]

expected_tfidf_train_w2v = y_train_tfidf_w2v.values

for i in range(0,x_train_tfidf_w2v.shape[0]):
    predicted_tfidf_train_w2v.extend((final_RF_tfidf_w2v.predict_proba(x_train_tfidf_w2v_csr[i])[:,1]>=0.4941).astype(bool))

predicted_tfidf_test_w2v=[]

expected_tfidf_test_w2v = y_test_tfidf_w2v.values

for i in range(0,x_test_tfidf_w2v.shape[0]):
    predicted_tfidf_test_w2v.extend((final_RF_tfidf_w2v.predict_proba(x_test_tfidf_w2v_csr[i])[:,1]>=0.4941).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_tfidf_train_w2v, predicted_tfidf_train_w2v)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using TFIDF W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_tfidf_test_w2v, predicted_tfidf_test_w2v)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using TFIDF W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

2.4.8 Applying GBDT Classifier brute force on TFIDF W2V, SET 4

Hyper paramter tuning method: GridSearch

In [0]:
#https://www.digitalocean.com/community/tutorials/how-to-plot-data-in-python-3-using-matplotlib
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
#https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import matplotlib.patches as mpatches
from sklearn.metrics import roc_auc_score

x_train_tfidf_w2v = hstack((df_train['cat_1'].values.reshape(-1,1), df_train['cat_0'].values.reshape(-1,1), df_train['subcat_1'].values.reshape(-1,1),
                  df_train['subcat_0'].values.reshape(-1,1), df_train['state_1'].values.reshape(-1,1), df_train['state_0'].values.reshape(-1,1),
                  df_train['teacherprefix_1'].values.reshape(-1,1), df_train['teacherprefix_0'].values.reshape(-1,1),
                  df_train['project_grade_category_1'].values.reshape(-1,1), df_train['project_grade_category_0'].values.reshape(-1,1), price_train_standardized,
                  prev_proj_train_standardized, wc_title_train_standardized, wc_essay_train_standardized, senti_score_train_standardized,
                        qty_train_standardized, title_train_bow, tfidf_w2v_train_text_vectors, tfidf_w2v_train_title_vectors))
y_train_tfidf_w2v = df_train['project_is_approved']

x_test_tfidf_w2v = hstack((df_test['cat_1'].values.reshape(-1,1), df_test['cat_0'].values.reshape(-1,1), df_test['subcat_1'].values.reshape(-1,1),
                  df_test['subcat_0'].values.reshape(-1,1), df_test['state_1'].values.reshape(-1,1), df_test['state_0'].values.reshape(-1,1),
                  df_test['teacherprefix_1'].values.reshape(-1,1), df_test['teacherprefix_0'].values.reshape(-1,1), 
                           df_test['project_grade_category_1'].values.reshape(-1,1), df_test['project_grade_category_0'].values.reshape(-1,1), price_test_standardized,
                  prev_proj_test_standardized, wc_title_test_standardized, wc_essay_test_standardized, senti_score_test_standardized,
                        qty_test_standardized, title_test_bow, tfidf_w2v_test_text_vectors, tfidf_w2v_test_title_vectors))
y_test_tfidf_w2v = df_test['project_is_approved']

print(x_train_tfidf_w2v.shape, type(x_train_tfidf_w2v), y_train_tfidf_w2v.shape, type(y_train_tfidf_w2v))
print(x_test_tfidf_w2v.shape, type(x_test_tfidf_w2v), y_test_tfidf_w2v.shape, type(y_test_tfidf_w2v))
(35000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (35000,) <class 'pandas.core.series.Series'>
(15000, 1675) <class 'scipy.sparse.coo.coo_matrix'> (15000,) <class 'pandas.core.series.Series'>
In [0]:
#https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#Initialising Classifier
classifier = GradientBoostingClassifier()

#Brute force approach for finding best K value
parameters = {'n_estimators': [5, 10, 50, 100, 200, 500]}

#Training the model on train data
GBDT_tfidf_w2v = GridSearchCV(classifier, parameters, return_train_score=True, cv=3, scoring='roc_auc', n_jobs=-1)
GBDT_tfidf_w2v.fit(x_train_tfidf_w2v, y_train_tfidf_w2v)
Out[0]:
GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                                  presort='auto',
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'n_estimators': [5, 10, 50, 100, 200, 500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://stackoverflow.com/questions/20944483/python-3-sort-a-dict-by-its-values/20948781

print(GBDT_tfidf_w2v.best_params_) #Gives the best value of parameters from the given range

print(GBDT_tfidf_w2v.cv_results_['mean_train_score'])
print(GBDT_tfidf_w2v.cv_results_['mean_test_score'])
print(parameters['n_estimators'])

plt.figure(figsize=(10,3))
plt.plot(parameters['n_estimators'],GBDT_tfidf_w2v.cv_results_['mean_train_score'], label="Train")
plt.plot(parameters['n_estimators'],GBDT_tfidf_w2v.cv_results_['mean_test_score'], label="Test")
plt.title('AUC plot for train and test datasets')
plt.xlabel('n_estimator values')
plt.ylabel('Area under ROC Curve')
plt.legend()
plt.grid()
plt.show()
plt.close()

plt.close()
{'n_estimators': 200}
[0.69624193 0.71510775 0.75665657 0.779309   0.80716989 0.85867612]
[0.67747586 0.69602169 0.72572338 0.73240088 0.73454294 0.7333579 ]
[5, 10, 50, 100, 200, 500]
In [0]:
#https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier
#https://stackoverflow.com/questions/34894587/should-we-plot-the-roc-curve-for-each-class

from sklearn.metrics import roc_curve, auc

#training the model on the best K value found in the above result 
final_GBDT_tfidf_w2v = GradientBoostingClassifier(n_estimators=200)
final_GBDT_tfidf_w2v.fit(x_train_tfidf_w2v, y_train_tfidf_w2v)

x_train_tfidf_w2v_csr=x_train_tfidf_w2v.tocsr()
x_test_tfidf_w2v_csr=x_test_tfidf_w2v.tocsr()

y_train_tfidf_w2v_pred=[]
y_test_tfidf_w2v_pred=[]

#ROC curve function takes the actual values and the predicted probabilities of the positive class
for i in range(0,x_train_tfidf_w2v.shape[0]):
    y_train_tfidf_w2v_pred.extend(final_GBDT_tfidf_w2v.predict_proba(x_train_tfidf_w2v_csr[i])[:,1]) #[:,1] gives the probability for class 1

for i in range(0,x_test_tfidf_w2v.shape[0]):
    y_test_tfidf_w2v_pred.extend(final_GBDT_tfidf_w2v.predict_proba(x_test_tfidf_w2v_csr[i])[:,1])
In [0]:
#https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
#https://www.programcreek.com/python/example/81207/sklearn.metrics.roc_curve
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#Calculating FPR and TPR for train and test data
train_tfidf_w2v_fpr, train_tfidf_w2v_tpr, train_tfidf_w2v_thresholds = roc_curve(y_train_tfidf_w2v, y_train_tfidf_w2v_pred)
test_tfidf_w2v_fpr, test_tfidf_w2v_tpr, test_tfidf_w2v_thresholds = roc_curve(y_test_tfidf_w2v, y_test_tfidf_w2v_pred)

#Calculating AUC for train and test curves
roc_auc_tfidf_w2v_train=auc(train_tfidf_w2v_fpr,train_tfidf_w2v_tpr)
roc_auc_tfidf_w2v_test=auc(test_tfidf_w2v_fpr,test_tfidf_w2v_tpr)

plt.plot(train_tfidf_w2v_fpr, train_tfidf_w2v_tpr, label="Train ROC Curve (area=%0.3f)" % roc_auc_tfidf_w2v_train)
plt.plot(test_tfidf_w2v_fpr, test_tfidf_w2v_tpr, label="Test ROC Curve (area=%0.3f)" % roc_auc_tfidf_w2v_test)
plt.plot([0,1],[0,1],linestyle='--')
plt.legend()
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC curve for TFIDF AVGW2V")
plt.grid()
plt.show()
plt.close()
In [0]:
print(np.median(train_tfidf_w2v_thresholds))
0.6419639425499812
In [0]:
#https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
#https://datatofish.com/confusion-matrix-python/

from sklearn.metrics import confusion_matrix as cf_mx

predicted_tfidf_train_w2v=[]

expected_tfidf_train_w2v = y_train_tfidf_w2v.values

for i in range(0,x_train_tfidf_w2v.shape[0]):
    predicted_tfidf_train_w2v.extend((final_GBDT_tfidf_w2v.predict_proba(x_train_tfidf_w2v_csr[i])[:,1]>=0.642).astype(bool))

predicted_tfidf_test_w2v=[]

expected_tfidf_test_w2v = y_test_tfidf_w2v.values

for i in range(0,x_test_tfidf_w2v.shape[0]):
    predicted_tfidf_test_w2v.extend((final_GBDT_tfidf_w2v.predict_proba(x_test_tfidf_w2v_csr[i])[:,1]>=0.642).astype(bool))
In [0]:
plt.subplots(figsize=(15,4))
plt.subplot(1,2,1)
cmdf_train=cf_mx(expected_tfidf_train_w2v, predicted_tfidf_train_w2v)
df_cm_train = pd.DataFrame(cmdf_train, range(2),range(2))
df_cm_train.columns = ['Predicted: NO','Predicted: YES']
df_cm_train = df_cm_train.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_train, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for train data using TFIDF W2V')

plt.subplot(1,2,2)
cmdf_test=cf_mx(expected_tfidf_test_w2v, predicted_tfidf_test_w2v)
df_cm_test = pd.DataFrame(cmdf_test, range(2),range(2))
df_cm_test.columns = ['Predicted: NO','Predicted: YES']
df_cm_test = df_cm_test.rename({0: 'Actual: NO', 1: 'Actual: YES'})
sns.heatmap(df_cm_test, annot=True,annot_kws={"size": 16}, fmt='g')
plt.title('Confusion matrix for test data using TFIDF W2V')
plt.subplots_adjust(wspace=0.5)
plt.show()
plt.close()

3. Conclusions

3.1 Random Forest Results

In [0]:
#http://zetcode.com/python/prettytable/

from prettytable import PrettyTable

print()

x = PrettyTable()

x.field_names = ["Vectorizer", "Model", "Hyper parameter(n_estimators, max_depth)", "AUC(Train Data)", "AUC(Test Data)"]

x.add_row(["BoW", "Brute", "50, 50", 1, 0.709])
x.add_row(["TFIDF", "Brute", "50, 50", 1, 0.698])
x.add_row(["W2V", "Brute", "10, 50", 0.831, 0.664])
x.add_row(["TFIDF AVG W2V", "Brute", "10, 50", 0.830, 0.698])

print(x)
+---------------+-------+------------------------------------------+-----------------+----------------+
|   Vectorizer  | Model | Hyper parameter(n_estimators, max_depth) | AUC(Train Data) | AUC(Test Data) |
+---------------+-------+------------------------------------------+-----------------+----------------+
|      BoW      | Brute |                  50, 50                  |        1        |     0.709      |
|     TFIDF     | Brute |                  50, 50                  |        1        |     0.698      |
|      W2V      | Brute |                  10, 50                  |      0.831      |     0.664      |
| TFIDF AVG W2V | Brute |                  10, 50                  |       0.83      |     0.698      |
+---------------+-------+------------------------------------------+-----------------+----------------+

3.1 GBDT Results

In [0]:
x = PrettyTable()

x.field_names = ["Vectorizer", "Model", "Hyper parameter(n_estimators)", "AUC(Train Data)", "AUC(Test Data)"]

x.add_row(["BoW", "Brute", "500", 0.843, 0.754])
x.add_row(["TFIDF", "Brute", "500", 0.860, 0.747])
x.add_row(["W2V", "Brute", "200", 0.807, 0.724])
x.add_row(["TFIDF AVG W2V", "Brute", "50", 0.762, 0.739])

print(x)
+---------------+-------+-------------------------------+-----------------+----------------+
|   Vectorizer  | Model | Hyper parameter(n_estimators) | AUC(Train Data) | AUC(Test Data) |
+---------------+-------+-------------------------------+-----------------+----------------+
|      BoW      | Brute |              500              |      0.843      |     0.754      |
|     TFIDF     | Brute |              500              |       0.86      |     0.747      |
|      W2V      | Brute |              200              |      0.807      |     0.724      |
| TFIDF AVG W2V | Brute |               50              |      0.762      |     0.739      |
+---------------+-------+-------------------------------+-----------------+----------------+
  • GBDT has produced better results both on train and test data compared to Random Forests.
  • The training time is significantly high for GBDT copared to Random Forets.